A new approach to the optimization of the 
extraction of astrometric and photometric 
information from multi-wavelength images in 
cosmological fields. 
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Abstract This paper describes a new approach to the optimization of information 
extraction in multi-wavelength image cubes of cosmological fields. 
The objective is to create a framework for the automatic identification and tagging 
of sources according to various criteria (isolated source, partially overlapped, fully 
C" | ■ overlapped, cross-matched, etc) and to set the basis for the automatic production of 

Q_|j the SEDs (spectral energy distributions) for all objects detected in the many multi- 

wavelength images in cosmological fields. 
Vh I In order to do so, a processing pipeline is designed that combines Voronoi tessel- 

ryj ■ lation, Bayesian cross-matching, and active contours to create a graph-based repre- 

ss \ sentation of the cross-match probabilities. This pipeline produces a set of SEDs with 

quality tags suitable for the application of already-proven data mining methods. 
£NJ | The pipeline briefly described here is also applicable to other astrophysical scenar- 

io ■ ios such as star forming regions. 
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p^ ■ Single-field multi-wavelength studies obtained with very heterogeneous instruments 

and telescopes are very common nowadays. Deep cosmological surveys are extreme 
examples of such studies that combine photometric data from the y-rays to the radio- 
wavelengths, offering complementary yet astonishingly different views of the same 

y\ ■ extragalactic objects. These image cubes carry both astrometric and photometric 

3 ' information of tens of thousands of sources, which bring their analysis into the realm 

of statistics and data mining. 

One of the key aspects of the systematic analysis of these image cubes is the 
reliability of the scientific products derived from them. In this work, we concentrate 
on the generation of spectral energy distributions (SEDs) of extragalactic sources 
in deep cosmological fields. The techniques outlined here are nevertheless of much 
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wider application in other astrophysical scenarios. We concentrate in particular in 
the problem of tagging the quality of a derived SED from the perspective of the 
underlying cross-match decisions. 

In Section [2] we describe the project and its aims, and the techniques utilized to 
derive spectral energy distributions from deep cosmological image cubes; Section 
[3]briefly summarizes the Bayesian approach that serves as the basis for the devel- 
opments presented in Section |4] which introduces the possibility of non-detections 
in the Bayesian formalism. Finally, Section [5] describes the results obtained for the 
application of the extended formalism to a toy problem, and Section 6 summarizes 
the main conclusions. 



2 Deep cosmological fields: the analysis pipeline 

In this work we address the problem of deriving spectral energy distributions and the 
labelling of the different sources detected in multi-wavelength deep images of cos- 
mological fields. Is is compounded of several sub-tasks, such as the cross-matching 
of the sources detected in individual images, the tagging of potential overlaps and 
the derivation of optimal regions for sky subtraction. 

Images of the same field obtained with different spatial resolutions, sensitivities 
and in various wavelengths will offer complementary views of the same sources, but 
also views that can be inconsistent if we do not take into account all these factors. 
Let us take for example the case where a galaxy A detected in low resolution infrared 
bands has a flux density below the detection threshold of a mid-infrarred survey, and 
has several potential counterparts in visible wavelengths, many of which do not ac- 
tually correspond to galaxy A, but to galaxies close to the line of sight. In addition 
to this, let us consider the possibility where one of the visible counterparts (but not 
the source that corresponds to galaxy A) is actually detected in the mid-infrared im- 
age. A sound cross-matching approach must necessarily address this problem in a 
probabilistic manner, including a requirement on astrometric and photometric con- 
sistency. The approach that we propose here is based on a Bayesian formalism of the 
problem of cross-matching catalogues that, as a by-product produces a quantitative 
measure of the validity of the counterpart assignment and flags SEDs that may be 
affected by source overlapping within and across images taken in several bands. 

In the first stage of our analysis pipeline, the catalogue extraction tool Sextrac- 
tor J2] is applied to each image separately. The catalogue thus obtained (includ- 
ing astrometric and photometric information) is used as the basis for a 2D Voronoi 
(Delaunay) tessellation of the images that defines a polygon in the corresponding 
coordinates (e.g., celestial, pixel) for each source. 

This 2D Voronoi tessellation of the images provides us with a preliminary cate- 
gorization of sources into the candidate categories of isolated source and partially 
or totally contaminated by neighbouring sources. A source is labelled as candidate 
for isolated source if it is fully contained in its Voronoi cell and none of the sources 
from the Voronoi cells surrounding the source under consideration is contaminating 



Title Suppressed Due to Excessive Length 3 

it. In this initial stage, the source extension is defined by its Kron ellipse |2) although 
subsequent refinements can be applied with more refined contours (active contours 
for example). This labelling procedure only considers information from one single 
image. The definition can be extended by defining an isolated source as one which 
is i) isolated in the lowest resolution image; ii) only has one counterpart in the pro- 
jection of its Voronoi cell in all other images and, iii) each of these counterpars is 
also isolated in the sense defined above. 

FigureQ]shows an example of the result of the implementation of this preliminary 
labelling process to the Hubble Deep Field image taken by the IRAC instrument on 
channel 3.6/im. 

A simple improvement of this approach consists in taking into account the source 
morphology in the determination of the isolation cell by applying Support Vector 
Machines for the determination of the maximum margin hyperplanes separating 
sources. 

The result from the previous steps will produce a set of two-dimensional vectors, 
xy , which represent the celestial coordinates of the source j in catalogue i together 
with the preliminary labelling described in the previous paragraphs. From this set of 
vectors, we aim at constructing reliable SEDs by cross-matching them taking into 
account the astrometric information, the photometric information and the instrument 
sensitivities. In the following, we will summarize the Bayesian formalism developed 
in HI that we further extend to potential non-detections. 
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Fig. 1 Examples of isolated and partially contaminated sources in the Hubble Deep Field image 
of IRAC instrument on channel 3.6/im. 



3 Cross-Matching of multi- wavelength astronomical sources 

The work presented in |Q], and summarized in the following paragraphs, proposes 
a bayesian approach for the decision-making problem of defining counterparts in 
multi-band image cubes. 

Let us define M as the hypothesis that the position of a source is on the celestial 
sphere, and let us parametrize this position in terms of a three-dimensional normal 
vector m. Let us assume that we have n overlapping images of a given field, and let 
us call data D = {xi,X2, ■••jXn} the n-tuple composed of the locations of n sources 
in the sky from the n different channels or images. Then, two hypothesis can be 
identified in this context: 

• H: hypothesis that the positions in the n-tuple correspond to a single source. 

• K: hypothesis that the positions do not correspond to a single source. 

Hypothesis H will be parametrized by a single common location m and the alter- 
native hypothesis K will be parametrized by n positions {rrij , i : 1 , 2, . . . , n). 
Therefore: 



P(D\H) 



p(m\H)-(Y[p(xi\m,H))d 3 m 
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Fig. 2 Examples of implementation of an iterative procedure for multi- wavelength cross-matching 
in the Hubble Deep Field image. 
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In (T), Budavari et al propose an iterative procedure based on the thresholding 
of the Bayes factor computed from equations[T]and[2]for the identification of coun- 
terparts in several catalogues. We have implemented this procedure and tested it 
with five real catalogues (one catalogue from IRAC and four catalogues from SUB- 
ARU). Figure [2] shows one example of this implementation. A threshold of Bo = 5 
was chosen to collect all possible candidates and the low-probability ones have been 
weeded out in subsequent steps. A unique astrometric precision of a < 0.2 for all 
catalogues has been considered. 



4 Extended Bayesian inference for the consideration of 
non-detection 

The possibility of having non-detected sources has not been taken into account so 
far in the formalism described above. In |JTj, Budavari et al. suggest one step further 
by thresholding a combined Bayes factor that includes the astrometric and the pho- 
tometric Bayes factors. In their proposal, the photometric Bayes factor gauges the 
two hypothesis that i) the photometric measurements of an n-tuple correspond to a 
single model SED (where a choice of parameterized models is available for galactic 
SEDs), or they come from independent and different SEDs. This allows us in gen- 
eral to reject a cross-matched proposal, but does not help in refining it by excluding 
inconsistent measurements. Here, we elaborate on that proposal in order to extract 
that kind of information that may allow us to construct a SED even if incomplete. 

Let us take as starting point n+l- tuples derived from the algorithm proposed 
in HI which uses only astrometric information. For obvious reasons, we define the 
n+l -tuple as a set of potential counterparts to the source detected in the lowest reso- 
lution image which drives the Voronoi tessellation in celestial coordinates described 
in section [3] Let us define this image as i = n + 1 in the following. 

In order to include the photometric information into the inference process, we 
will assume that there exists a model for the galactic SED which is parametrized by 
the set {rj/t^k = 1,2, ...,K}. In (TJ, the authors parametrize each SED by a discrete 
spectral type T, the redshift z and an overall scaling factor for the brightness, a; an 
additional simplification which makes a = 1 can be obtained here by normalizing 
the SED. 

It is important to note that each instrument has its own detection limit which 
depends, in general and amongst other factors, on the spatial flux density of a source 
and not on the total integrated flux; however, and for the sake of simplicity we will 
only consider here flux thresholds instead of fully modelling the detection process, 
which is always the correct approach, specially when dealing with extended sources. 

The cross-matching problem described in section |3]requires the ability to identify 
the same source across different images with different measurement instruments. 
The consideration of having sources not detected under study has not been taken 
into account so far for the model described in 0]. 



Let us take as starting point N tuples of n+1 elements derived from the algorithm 
proposed in fT). 

For the sake of simplicity of this preliminary model, the existence of one and only 
one detected source in the channel which drives the voronoi tessellation in celestial 
coordinates described in section [3] will be assumed, therefore there will always exist 
a detection in this channel. 

To deal with the concept of non-detection, the use of photometric information is 
required and for that purpose the photometric model proposed in JT] will be used 
and extended. As indicated in |6|, a wealth of models has been created with the goal 
of choosing and extracting useful information from SEDs. In our case we will follow 
the same simple model for the SED as the one indicated in QJ. Let us consider the 
dataZ)' as an n+1 -tuple of the measured fluxes: D 1 = {gi,g2,-- 7 ,gn+i}- 

The Bayesian inference for this photometric model will be run on the following 
two mutually exclusive hypothesis: 

• Hi: all the fluxes gj correspond to the same source. 

• Ki : not all the fluxes g; correspond to the same source. 



The evidences for the hypothesis Hi and Ki are: 

n+1 



p{Df\Hy) = f p(7]\H 1 )flp i (g i \r],H 1 )d r n 

J 1=1 



(3) 



where: 



• 77 are the parameters for modelling the spectral energy distribution. 

• p(r\\H\) is the prior probability which should be carefully chosen from one of 
the models proposed in [ 6 1 , for example, SWIRE database could be a good option 
for IRAC catalogues. 

• Pi (gi I f] > ^1 ) i s me probability that one source with SED parameters 77 has a mea- 
sured flux of gj. 

For hypothesis Ki we will take on board the consideration for the possibilities of 
having sources non-detected in one or several channels. This means that the hypoth- 
esis Ki contains a combinatorial number of sub-hypothesis (i.e. that the source has 
not been detected in any possible combination of channels, and that the detections 
in these channels correspond to nearby sources in the celestial sphere). 

In this way, one new sub-hypothesis is established per combination found; there- 
fore there will be: 

• C n \= n sub-hypothesis for one non-detection. 

• C n .p = } u",'_ \i sub-hypothesis forp non-detections. 

• C n . n = 1 sub-hypothesis for n non-detections. 

The formalism proposed here for the hypotheis Ki will include all the indepen- 
dent sub-hypothesis described before. 

Let us be P„, p = {Lu lv _ / \} the set of sub-hypothesis with p non-detections and 
with n — p detections. 



Title Suppressed Due to Excessive Length 7 

The generic expression for hypothesis Ki, taking on board all the possibilities of 
non-detections from an n+1 -tuple is as follows: 



p(D l \K l ) = l\yp(r h \L)-p i (g i \r, i ,L)d r r,, 

tin { [ [^ptteMpimW&rriX-'ll {fpinM-pMriWTi} 

p=lLePn,pi={h ip) ^ J -°° > j=hm ^ J > 



Where the non-detection for the source i can be modelled as the area below the 
detection threshold, r /„-, of a Gaussian distribution. 

The evidence for hypothesis Kj, as expressed in equation @] includes the com- 
binatorial number of the exclusive sub-hypothesis presented before. In this way, an 
unambiguous description of each specific combination of non-detection(s) among 
the channels of the n+1 -tuple is feasible. 

The use of the different Bayes Factors per sub-hypothesis will allow the identi- 
fication of the most favourable model; alternatively other statistics as the Bayesian 
Model Averaging (BMA) can also provide the assessment on how probable is a 
model given the data conditionally on a set of models considered, Li,...,L ; ,,...,L„, 
being L ; , the set of sub-hypothesis corresponding to p non-detections. Initially we 
would assign the same value for each sub-hypothesis. 



5 Toy example 

Let us model the radiation of a black body using Planck's law. This function depends 
on the frequency v. 

2h v 5 1 

J (y,T) = ^--^— (5) 

c eKT — 1 

Let us consider a set of measurements gj of the black body intensities I(v,T), there- 
fore for a 6-tuple we will have the following data: D' = {g\, ....g6}; for the prior 
we will use a flat function as a first approximation and we will assume a Gaussian 
distribution for the uncertainties measurement Pi(gi\T), Note that in this case the 
example has been modelled in such a way that the measurement in channel 3, #3, 
will not correspond to the cross-matching. 

Applying equations [3] and [4] we obtain the following Bayes factor: 

p(D\Hi) , 

B= '\ { =1.92-1(T 2 (6) 

p(D\Kl) 

Therefore the model Ki will be clearly more favourable than the model Hi. We 
can go a step further by applying here the extended Bayesian formalism presented, 
from which the Bayes Factors of all possible sub-hypothesis are obtained, resulting 
sub-hypothesis Lr 3 i the most favourable one, as expected. 



Bayes Factors 




1 


1.00000E-01 

l.OOOOOE-02 

l.OOOOOE-03 

1.00000E-04 

h 1. 00000 E -05 

i 

? 1. 00000 E -06 

■ l.OOOOOE-07 

l.OOOOOE-08 

l.OOOOOE-09 

1.00000E-10 

l.OOOOOE-11 

1.00000E-12 

l.OOOOOE-13 


\tvinloAiviAlovjiilgt 


¥ J 


3 $ $ $ 'f § t 


3 $ if f§ i 


£ £ 3 


■v ~ 


' u? 




11 






■ 


















































1 
























1 
















































































































































































No detections 





Fig. 3 Bayes Factors for all sub-hypothesis included in the hypothesis K\ 



6 Conclusions 

The proposed extended Bayesian formalism for the probabilistic cross-matching 
problem drives the identification of the most favourable model among many when 
all the posible exclusive combinations of having non-detected sources within the 
n+1 -tuple are taken into account; this stage leads to an obvious refinement phase in 
the construction of consistent SEDs, allowing a more precise labelling process for 
sources detected in multi-wavelength deep images of cosmological fields. 
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